30  Chi Square Test

Overview

The Chi-square Test is a non-parametric statistical test used to examine whether distributions of categorical variables differ from one another. It’s widely used to test the association between two categorical variables in a contingency table. The test compares the observed frequencies in each category of a contingency table with the expected frequencies, which are calculated based on the assumption that there is no association between the variables.

  • Chi-square Test of Independence: Used to determine if there is a significant association between two categorical variables.
  • Chi-square Goodness-of-Fit Test: Used to see if a sample distribution matches an expected distribution.

Chi-square Test is applicable in a wide range of disciplines, including sociology, marketing, and education, to test hypotheses about associations or differences in categorical data distributions.

30.1 Chi-square Test of Independence

Overview

The Chi-square Test of Independence is a non-parametric statistical test used to determine if there is a significant association between two categorical variables. This test assesses whether observed frequencies in a contingency table differ significantly from expected frequencies, which are calculated under the assumption of independence between the variables.

Null and Alternative Hypotheses

  • Null Hypothesis (H0): The null hypothesis states that there is no association between the two categorical variables; they are independent.
  • Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant association between the two categorical variables.

Test Statistic

  • The test statistic for the Chi-square Test of Independence is calculated as follows:

    \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency in each category.

  • The test statistic follows a chi-squared distribution with \((r-1)\) times \((c-1)\) degrees of freedom, where \(r\) is the number of rows and \(c\) is the number of columns in the contingency table.

Calculation of Expected Frequencies

  • Expected frequencies are calculated based on the marginal totals of the contingency table: \[ E_{ij} = \frac{(R_i \times C_j)}{N} \] where \(R_i\) is the total for row \(i\), \(C_j\) is the total for column \(j\), and \(N\) is the overall total number of observations.

Interpretation of Results

  • If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating a significant association between the variables.

Applications

  • Sociology: To analyze the relationship between education level and employment status.
  • Medicine: To study the association between a risk factor (like smoking) and the incidence of a disease.
  • Marketing: To evaluate the relationship between customer demographics and product preferences.

30.1.1 Example problem on Chi-square Test of Independence

A researcher wants to determine if there is an association between gender (male, female) and preference for a new product (like, dislike). The data collected is as follows:

Gender Like Dislike
Male 20 10
Female 30 40

Null and Alternative Hypotheses

  • Null Hypothesis (H0): There is no association between gender and preference for the product; they are independent.
  • Alternative Hypothesis (H1): There is an association between gender and preference for the product.

Step-by-Step Calculation

  1. Observed Frequencies (O): The observed frequencies are given in the table:

    Gender Like Dislike Row Totals
    Male 20 10 30
    Female 30 40 70
    Column Totals 50 50 100
  2. Expected Frequencies (E): The expected frequencies are calculated based on the assumption of independence. The expected frequency for each cell is calculated using the formula: \[ E_{ij} = \frac{(R_i \times C_j)}{N} \] where \(R_i\) is the row total, \(C_j\) is the column total, and \(N\) is the grand total.

    • For Male and Like: \[ E_{11} = \frac{(30 \times 50)}{100} = 15 \]

    • For Male and Dislike: \[ E_{12} = \frac{(30 \times 50)}{100} = 15 \]

    • For Female and Like: \[ E_{21} = \frac{(70 \times 50)}{100} = 35 \]

    • For Female and Dislike: \[ E_{22} = \frac{(70 \times 50)}{100} = 35 \]

    The expected frequencies are:

    Gender Like (E) Dislike (E)
    Male 15 15
    Female 35 35
  3. Chi-square Test Statistic: The test statistic is calculated using the formula: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

    • For Male and Like: \[ \chi^2_{11} = \frac{(20 - 15)^2}{15} = \frac{25}{15} = 1.67 \]

    • For Male and Dislike: \[ \chi^2_{12} = \frac{(10 - 15)^2}{15} = \frac{25}{15} = 1.67 \]

    • For Female and Like: \[ \chi^2_{21} = \frac{(30 - 35)^2}{35} = \frac{25}{35} = 0.71 \]

    • For Female and Dislike: \[ \chi^2_{22} = \frac{(40 - 35)^2}{35} = \frac{25}{35} = 0.71 \]

    The total chi-square statistic is: \[ \chi^2 = 1.67 + 1.67 + 0.71 + 0.71 = 4.76 \]

  4. Degrees of Freedom (df): The degrees of freedom for the test is calculated as: \[ df = (r - 1) \times (c - 1) \] where \(r\) is the number of rows and \(c\) is the number of columns. In this case: \[ df = (2 - 1) \times (2 - 1) = 1 \]

  5. Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 1 degree of freedom can be found in chi-square distribution tables. The critical value is 3.841.

    Compare the calculated \(\chi^2\) value with the critical value:

    • If \(\chi^2 > 3.841\), reject the null hypothesis.
    • Otherwise, do not reject the null hypothesis.

    In this case, \(\chi^2 = 4.76\) which is greater than 3.841, so we reject the null hypothesis.

    Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4.76\) with 1 degree of freedom, the p-value is approximately 0.029.

Interpretation

Since the p-value (0.029) is less than the significance level (\(\alpha = 0.05\)), we reject the null hypothesis. There is sufficient evidence to conclude that there is a significant association between gender and preference for the new product.

R Code for Chi-square Test of Independence

Code
# Data for the Chi-square Test of Independence
data <- matrix(c(20, 10, 30, 40), nrow = 2, byrow = TRUE)

# Perform the test
result <- chisq.test(data)

# Output the result
print(result)

    Pearson's Chi-squared test with Yates' continuity correction

data:  data
X-squared = 3.8571, df = 1, p-value = 0.04953

Python Code for Chi-square Test of Independence

Code
import numpy as np
from scipy.stats import chi2_contingency

# Data for the Chi-square Test of Independence
data = np.array([[20, 10], [30, 40]])

# Perform the test
chi2, p, dof, expected = chi2_contingency(data)

# Output the result
print(f"Chi2 Statistic: {chi2}")
Chi2 Statistic: 3.8571428571428577
Code
print(f"P-value: {p}")
P-value: 0.04953461343562668
Code
print(f"Degrees of Freedom: {dof}")
Degrees of Freedom: 1
Code
print("Expected Frequencies:")
Expected Frequencies:
Code
print(expected)
[[15. 15.]
 [35. 35.]]

30.2 Chi-square Goodness-of-Fit Test

Overview

The Chi-square Goodness-of-Fit Test is used to determine whether a sample distribution matches an expected distribution. This test compares the observed frequencies of categories to the frequencies expected under a specified theoretical distribution.

Null and Alternative Hypotheses

  • Null Hypothesis (H0): The null hypothesis states that the sample distribution matches the expected distribution.
  • Alternative Hypothesis (H1): The alternative hypothesis suggests that there is a significant difference between the observed and expected distributions.

Test Statistic

  • The test statistic for the Chi-square Goodness-of-Fit Test is calculated as follows: \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \] where \(O_i\) represents the observed frequency, and \(E_i\) represents the expected frequency for each category.

  • The test statistic follows a chi-squared distribution with \(k-1\) degrees of freedom, where \(k\) is the number of categories.

Calculation of Expected Frequencies

  • Expected frequencies are calculated based on the theoretical distribution. For example, if testing a uniform distribution, the expected frequency for each category would be equal.

Interpretation of Results

  • If the calculated \(\chi^2\) value is greater than the critical value from the chi-squared distribution at the chosen significance level (commonly \(\alpha = 0.05\)), the null hypothesis is rejected, indicating that the sample distribution significantly differs from the expected distribution.

Applications

  • Genetics: To determine if the observed frequencies of different genotypes match the expected frequencies under Mendelian inheritance.
  • Quality Control: To check if the observed defect rates in different categories match the expected rates.
  • Survey Analysis: To see if the distribution of survey responses matches the expected distribution based on population proportions.

30.2.1 Example problem on Chi-square Goodness-of-Fit Test

A company wants to know if the observed sales distribution across four regions (North, South, East, West) matches their expected distribution. The expected distribution is equal across all regions. The observed sales are as follows:

Region Observed Sales
North 50
South 60
East 40
West 50

The company will use the Chi-square Goodness-of-Fit Test to analyze the data.

Null and Alternative Hypotheses

  • Null Hypothesis (H0): The observed sales distribution matches the expected distribution (equal across all regions).
  • Alternative Hypothesis (H1): The observed sales distribution does not match the expected distribution.

Step-by-Step Calculation

  1. Observed Frequencies (O): The observed frequencies are given in the table:

    Region Observed Sales (O)
    North 50
    South 60
    East 40
    West 50
  2. Expected Frequencies (E): The expected frequencies are calculated based on the assumption that the sales are equally distributed across all regions. Since the total number of observations is 200 (50 + 60 + 40 + 50 = 200) and there are four regions, the expected frequency for each region is:

    \[ E = \frac{\text{Total Sales}}{\text{Number of Regions}} = \frac{200}{4} = 50 \]

    So, the expected frequencies are:

    Region Expected Sales (E)
    North 50
    South 50
    East 50
    West 50
  3. Chi-square Test Statistic: The test statistic is calculated using the formula:

    \[ \chi^2 = \sum \frac{(O_i - E_i)^2}{E_i} \]

    • For North: \[ \chi^2_{North} = \frac{(50 - 50)^2}{50} = \frac{0}{50} = 0 \]

    • For South: \[ \chi^2_{South} = \frac{(60 - 50)^2}{50} = \frac{100}{50} = 2 \]

    • For East: \[ \chi^2_{East} = \frac{(40 - 50)^2}{50} = \frac{100}{50} = 2 \]

    • For West: \[ \chi^2_{West} = \frac{(50 - 50)^2}{50} = \frac{0}{50} = 0 \]

    The total chi-square statistic is:

    \[ \chi^2 = 0 + 2 + 2 + 0 = 4 \]

  4. Degrees of Freedom (df): The degrees of freedom for the test is calculated as:

    \[ df = k - 1 \]

    where \(k\) is the number of categories (regions). In this case:

    \[ df = 4 - 1 = 3 \]

  5. Critical Value and P-value: The critical value for \(\chi^2\) at \(\alpha = 0.05\) and 3 degrees of freedom can be found in chi-square distribution tables. The critical value is 7.815.

    Compare the calculated \(\chi^2\) value with the critical value:

    • If \(\chi^2 > 7.815\), reject the null hypothesis.
    • Otherwise, do not reject the null hypothesis.

    In this case, \(\chi^2 = 4\) which is less than 7.815, so we do not reject the null hypothesis.

    Alternatively, you can calculate the p-value using a chi-square distribution calculator or software. For \(\chi^2 = 4\) with 3 degrees of freedom, the p-value is approximately 0.261.

Interpretation

Since the p-value (0.261) is greater than the significance level (\(\alpha = 0.05\)), we do not reject the null hypothesis. There is insufficient evidence to conclude that the observed sales distribution significantly differs from the expected distribution.

R Code for Chi-square Goodness-of-Fit Test

Code
# Observed sales
observed <- c(50, 60, 40, 50)

# Expected sales (equal distribution)
expected <- rep(sum(observed) / length(observed), length(observed))

# Perform the test
result <- chisq.test(observed, p = expected / sum(expected))

# Output the result
print(result)

    Chi-squared test for given probabilities

data:  observed
X-squared = 4, df = 3, p-value = 0.2615

Python Code for Chi-square Goodness-of-Fit Test

Code
import numpy as np
from scipy.stats import chisquare

# Observed sales
observed = np.array([50, 60, 40, 50])

# Expected sales (equal distribution)
expected = np.full(len(observed), np.mean(observed))

# Perform the test
chi2, p = chisquare(f_obs=observed, f_exp=expected)
# Output the result
print(f"Chi2 Statistic: {chi2}")
Chi2 Statistic: 4.0
Code
print(f"P-value: {p}")
P-value: 0.26146412994911034
Code
print("Observed Frequencies:")
Observed Frequencies:
Code
print(observed)
[50 60 40 50]
Code
print("Expected Frequencies:")
Expected Frequencies:
Code
print(expected)
[50. 50. 50. 50.]